Apache Hadoop Distributed Copy – DistCp Guide

您所在的位置：网站首页 › hadoop s3a › Apache Hadoop Distributed Copy – DistCp Guide

Apache Hadoop Distributed Copy – DistCp Guide

2023-08-08 15:35| 来源: 网络整理| 查看: 265

DistCp Guide Overview Usage Basic Usage Update and Overwrite Sync Command Line Options Architecture of DistCp DistCp Driver Copy-listing Generator InputFormats and MapReduce Components Appendix Map sizing Copying Between Versions of HDFS MapReduce and other side-effects Frequently Asked Questions Overview

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

[The erstwhile implementation of DistCp] (http://hadoop.apache.org/docs/r1.2.1/distcp.html) has its share of quirks and drawbacks, both in its usage, as well as its extensibility and performance. The purpose of the DistCp refactor was to fix these shortcomings, enabling it to be used and extended programmatically. New paradigms have been introduced to improve runtime and setup performance, while simultaneously retaining the legacy behaviour as default.

This document aims to describe the design of the new DistCp, its spanking new features, their optimal use, and any deviance from the legacy implementation.

Usage Basic Usage

The most common invocation of DistCp is an inter-cluster copy:

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each NodeManager from nn1 to nn2.

One can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \ hdfs://nn1:8020/foo/b \ hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:

bash$ hadoop distcp -f hdfs://nn1:8020/srclist \ hdfs://nn2:8020/bar/foo

Where srclist contains

hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b

When copying from multiple sources, DistCp will abort the copy with an error message if two sources collide, but collisions at the destination are resolved per the options specified. By default, files already existing at the destination are skipped (i.e. not replaced by the source file). A count of skipped files is reported at the end of each job, but it may be inaccurate if a copier failed for some subset of its files, but succeeded on a later attempt.

It is important that each NodeManager can reach and communicate with both the source and destination file systems. For HDFS, both the source and destination must be running the same version of the protocol or use a backwards-compatible protocol; see [Copying Between Versions] (#Copying_Between_Versions_of_HDFS).

After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to verify that the copy was truly successful. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy. Some have had success running with -update enabled to perform a second pass, but users should be acquainted with its semantics before attempting this.

It’s also worth noting that if another client is still writing to a source file, the copy will likely fail. Attempting to overwrite a file being written at the destination should also fail on HDFS. If a source file is (re)moved before it is copied, the copy will fail with a FileNotFoundException.

Please refer to the detailed Command Line Reference for information on all the options available in DistCp.

Update and Overwrite

-update is used to copy files from source that don’t exist at the target or differ from the target version. -overwrite overwrites target-files that exist at the target.

The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/, where the source paths have the following contents:

hdfs://nn1:8020/source/first/1 hdfs://nn1:8020/source/first/2 hdfs://nn1:8020/source/second/10 hdfs://nn1:8020/source/second/20

When DistCp is invoked without -update or -overwrite, the DistCp defaults would create directories first/ and second/, under /target. Thus:

distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target

would yield the following contents in /target:

hdfs://nn2:8020/target/first/1 hdfs://nn2:8020/target/first/2 hdfs://nn2:8020/target/second/10 hdfs://nn2:8020/target/second/20

When either -update or -overwrite is specified, the contents of the source-directories are copied to target, and not the source directories themselves. Thus:

distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target

would yield the following contents in /target:

hdfs://nn2:8020/target/1 hdfs://nn2:8020/target/2 hdfs://nn2:8020/target/10 hdfs://nn2:8020/target/20

By extension, if both source folders contained a file with the same name (say, 0), then both sources would map an entry to /target/0 at the destination. Rather than to permit this conflict, DistCp will abort.

Now, consider the following copy operation:

distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target

With sources/sizes:

hdfs://nn1:8020/source/first/1 32 hdfs://nn1:8020/source/first/2 32 hdfs://nn1:8020/source/second/10 64 hdfs://nn1:8020/source/second/20 32

And destination/sizes:

hdfs://nn2:8020/target/1 32 hdfs://nn2:8020/target/10 32 hdfs://nn2:8020/target/20 64

Will effect:

hdfs://nn2:8020/target/1 32 hdfs://nn2:8020/target/2 32 hdfs://nn2:8020/target/10 64 hdfs://nn2:8020/target/20 32

1 is skipped because the file-length and contents match. 2 is copied because it doesn’t exist at the target. 10 and 20 are overwritten since the contents don’t match the source.

If -update is used, 1 is skipped because the file-length and contents match. 2 is copied because it doesn’t exist at the target. 10 and 20 are overwritten since the contents don’t match the source. However, if -append is additionally used, then only 10 is overwritten (source length less than destination) and 20 is appended with the change in file (if the files match up to the destination’s original length).

If -overwrite is used, 1 is overwritten as well.

Sync

-diff option syncs files from a source cluster to a target cluster with a snapshot diff. It copies, renames and removes files in the snapshot diff list.

-update option must be included when -diff option is in use.

Most cloud providers don’t work well with sync at the moment.

Usage:

hadoop distcp -update -diff

Example:

hadoop distcp -update -diff snap1 snap2 /src/ /dst/

The command above applies changes from snapshot snap1 to snap2 (i.e. snapshot diff from snap1 to snap2) in /src/ to /dst/. Obviously, it requires /src/ to have both snapshots snap1 and snap2. But the destination /dst/ must also have a snapshot with the same name as , in this case snap1. The destination /dst/ should not have new file operations (create, rename, delete) since snap1. Note that when this command finishes, a new snapshot snap2 will NOT be created at /dst/.

-update is required to use -diff option.

For instance, in /src/, if 1.txt is added and 2.txt is deleted after the creation of snap1 and before creation of snap2, the command above will copy 1.txt from /src/ to /dst/ and delete 2.txt from /dst/.

Sync behavior will be elaborated using experiments below.

Experiment 1: Syncing diff of two adjacent snapshots

Some preparations before we start.

# Create source and destination directories hdfs dfs -mkdir /src/ /dst/ # Allow snapshot on source hdfs dfsadmin -allowSnapshot /src/ # Create a snapshot (empty one) hdfs dfs -createSnapshot /src/ snap1 # Allow snapshot on destination hdfs dfsadmin -allowSnapshot /dst/ # Create a from_snapshot with the same name hdfs dfs -createSnapshot /dst/ snap1 # Put one text file under /src/ echo "This is the 1st text file." > 1.txt hdfs dfs -put 1.txt /src/ # Create the second snapshot hdfs dfs -createSnapshot /src/ snap2 # Put another text file under /src/ echo "This is the 2nd text file." > 2.txt hdfs dfs -put 2.txt /src/ # Create the third snapshot hdfs dfs -createSnapshot /src/ snap3

Then we run distcp sync:

hadoop distcp -update -diff snap1 snap2 /src/ /dst/

The command above should succeed. 1.txt will be copied from /src/ to /dst/. Again, -update option is required.

If we run the same command again, we will get DistCp sync failed exception because the destination has added a new file 1.txt since snap1. That being said, if we remove 1.txt manually from /dst/ and run the sync, the command will succeed.

Experiment 2: syncing diff of two non-adjacent snapshots

First do a clean up from Experiment 1.

hdfs dfs -rm -skipTrash /dst/1.txt

Run sync command, note the has been changed from snap2 in Experiment 1 to snap3.

hadoop distcp -update -diff snap1 snap3 /src/ /dst/

Both 1.txt and 2.txt will be copied to /dst/.

Experiment 3: syncing file delete operation

Continuing from the end of Experiment 2:

hdfs dfs -rm -skipTrash /dst/2.txt # Create snap2 at destination, it contains 1.txt hdfs dfs -createSnapshot /dst/ snap2 # Delete 1.txt from source hdfs dfs -rm -skipTrash /src/1.txt # Create snap4 at source, it only contains 2.txt hdfs dfs -createSnapshot /src/ snap4

Run sync command now:

hadoop distcp -update -diff snap2 snap4 /src/ /dst/

2.txt is copied and 1.txt is deleted under /dst/.

Note that, though both /src/ and /dst/ have snapshot with the same name snap2, the snapshots don’t need to have the same content. That means, if you have a 1.txt in /dst/’s snap2 but they have different contents, 1.txt will still be removed from /dst/. The sync command doesn’t check the contents of the files that is going to be deleted. It simply follows the snapshot diff list between and .

Also, if we delete 1.txt from /dst/ before creating snap2 on /dst/ in the steps above, so that /dst/’s snap2 doesn’t have 1.txt before running sync command, the command will still succeed. It won’t throw exception while trying to delete 1.txt from /dst/ which doesn’t exist.

raw Namespace Extended Attribute Preservation

This section only applies to HDFS.

If the target and all of the source pathnames are in the /.reserved/raw hierarchy, then ‘raw’ namespace extended attributes will be preserved. ‘raw’ xattrs are used by the system for internal functions such as encryption meta data. They are only visible to users when accessed through the /.reserved/raw hierarchy.

raw xattrs are preserved based solely on whether /.reserved/raw prefixes are supplied. The -p (preserve, see below) flag does not impact preservation of raw xattrs.

To prevent raw xattrs from being preserved, simply do not use the /.reserved/raw prefix on any of the source and target paths.

If the /.reserved/rawprefix is specified on only a subset of the source and target paths, an error will be displayed and a non-0 exit code returned.

Command Line Options Flag Description Notes -p[rbugpcaxt] Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr t: timestamp When -update is specified, status updates will not be synchronized unless the file sizes also differ (i.e. unless the file is re-created). If -pa is specified, DistCp preserves the permissions also because ACLs are a super-set of permissions. The option -pr is only valid if both source and target directory are not erasure coded. -i Ignore failures As explained in the Appendix, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted. -log Write logs to DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed. -v Log additional info (path, size) in the SKIP/COPY log This option can only be used with -log option. -m Maximum number of simultaneous copies Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput. -overwrite Overwrite destination If a map fails and -i is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully. -update Overwrite if source and destination differ in size, blocksize, or checksum As noted in the preceding, this is not a “sync” operation. The criteria examined are the source and destination file sizes, blocksizes, and checksums; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully. -append Incremental copy of file with same name but different length If the source file is greater in length than the destination file, the checksum of the common length part is compared. If the checksum matches, only the difference is copied using read and append functionalities. The -append option only works with -update without -skipcrccheck -f Use list at as src list This is equivalent to listing each source on the command line. The urilist_uri list should be a fully qualified URI. -filters The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. Support regular expressions specified by java.util.regex.Pattern. -filelimit Limit the total number of files to be

【本文地址】

公司简介

联系我们